Lab6 - Logistic Regression:¶

Hamna Ashraf - 8826836¶

Get the data:

In [ ]:
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import plotly
plotly.offline.init_notebook_mode()
In [ ]:
iris = datasets.load_iris(as_frame=True)
In [ ]:
y = iris.target_names[iris.target] == 'virginica'

target_names = {0: 'non-virginica', 
                1: 'virginica'}

target = np.where(y, 1, 0)

data = iris.data

df_iris = pd.DataFrame(data, columns = iris.feature_names)
df_iris['target'] = target

df_iris['target_names'] = df_iris['target'].map(target_names)
df_iris
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target target_names
0 5.1 3.5 1.4 0.2 0 non-virginica
1 4.9 3.0 1.4 0.2 0 non-virginica
2 4.7 3.2 1.3 0.2 0 non-virginica
3 4.6 3.1 1.5 0.2 0 non-virginica
4 5.0 3.6 1.4 0.2 0 non-virginica
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 1 virginica
146 6.3 2.5 5.0 1.9 1 virginica
147 6.5 3.0 5.2 2.0 1 virginica
148 6.2 3.4 5.4 2.3 1 virginica
149 5.9 3.0 5.1 1.8 1 virginica

150 rows × 6 columns

Explore Data:¶

Discriptive Statistics for Non-Virginica:

In [ ]:
desc_non_virginica = df_iris[df_iris['target_names'] == 'non-virginica'].describe()
desc_non_virginica
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
count 100.000000 100.000000 100.000000 100.000000 100.0
mean 5.471000 3.099000 2.861000 0.786000 0.0
std 0.641698 0.478739 1.449549 0.565153 0.0
min 4.300000 2.000000 1.000000 0.100000 0.0
25% 5.000000 2.800000 1.500000 0.200000 0.0
50% 5.400000 3.050000 2.450000 0.800000 0.0
75% 5.900000 3.400000 4.325000 1.300000 0.0
max 7.000000 4.400000 5.100000 1.800000 0.0

Discriptive Statistics for Virginica:

In [ ]:
desc_virginica = df_iris[df_iris['target_names'] == 'virginica'].describe()
desc_virginica
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
count 50.00000 50.000000 50.000000 50.00000 50.0
mean 6.58800 2.974000 5.552000 2.02600 1.0
std 0.63588 0.322497 0.551895 0.27465 0.0
min 4.90000 2.200000 4.500000 1.40000 1.0
25% 6.22500 2.800000 5.100000 1.80000 1.0
50% 6.50000 3.000000 5.550000 2.00000 1.0
75% 6.90000 3.175000 5.875000 2.30000 1.0
max 7.90000 3.800000 6.900000 2.50000 1.0

Histograms for Iris Features:

In [ ]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))


plt.subplot(2, 2, 1)
sns.histplot(data=df_iris, x='sepal length (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Sepal Length')

plt.subplot(2, 2, 2)
sns.histplot(data=df_iris, x='sepal width (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Sepal Width')

plt.subplot(2, 2, 3)
sns.histplot(data=df_iris, x='petal length (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Petal Length')

plt.subplot(2, 2, 4)
sns.histplot(data=df_iris, x='petal width (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Petal Width')

plt.tight_layout(rect=[0, 0.05, 1, 0.95])

Correlation matrix between the four features:

In [ ]:
df_iris_corr = df_iris.copy()
df_iris_corr = df_iris_corr.drop('target_names', axis = 1) 
#we needed to drop the names in order to make a correlation matrix
In [ ]:
corr_values = df_iris_corr.corr().round(2)
sns.heatmap(corr_values, cmap="RdBu", annot=True, linewidth=.7)
Out[ ]:
<Axes: >

The graphs given below are taken from this kaggle notebook:
https://www.kaggle.com/code/harshalgadhe/iris-species-prediction-using-ml-techniques/notebook

In [ ]:
def scatterplot(X,Y):
    sns.scatterplot(data=df_iris,x=X,y=Y,hue='target_names')
    plt.title(X+" vs "+Y)
    plt.grid()
    plt.show()


plt.figure(figsize=(10,6))
scatterplot('sepal width (cm)','sepal length (cm)')
plt.figure(figsize=(10,6))
scatterplot('petal width (cm)','petal length (cm)')
plt.tight_layout()
<Figure size 640x480 with 0 Axes>

Insight:
From the sepal widhth vs sepal length graph given above, we can see that between the lengths between 5.5 and 6.5, it is hard to distinguish between the two classes. Before, in the correlation matrix, the correlation value for target was greater for petal length and petal width. Therefore, in the second graph we can clear distinguish that petal width greater than 1.5 cm and petal length greater than 5.3 will result in Virginica

In [ ]:
sns.pairplot(df_iris_corr,hue='target',palette='deep')
#Target 0-> non-virginica
#       1-> virginica
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1f0d724db10>

Insights:
From this graph we can see the plots against all features. In these graphs we can get some useful information. The plot for petal length, we can see that it is hard to classify when petal length is 4.5-4.8. This is because both lines intersect at this point.

Next graph is taken from:
https://www.kaggle.com/code/abdmental01/iris-flower-eda-ml-beginner

In [ ]:
import plotly.express as px

custom_colors = ['#1f77b4', 'skyblue', 'cornflowerblue']
# Create a box plot to show the relationship between a numerical variable and a categorical variable
fig_boxplot = px.box(df_iris, x="target_names", y="petal width (cm)", title="Box Plot of Petal Width (cm) by Species",
                     color='target_names',  # Set box color based on Species
                     color_discrete_sequence=custom_colors  # Custom color sequence
                    )

# Update layout
fig_boxplot.update_layout(xaxis_title="Target", yaxis_title="PetalWidthCm",)

# Show the box plot
fig_boxplot.show()

Insights:
This boxplot shows the petal width graph. If we hover over the plot, it shows the min, max values and median of petal width.
Petal width greater than 1.8 will result in Virginica (100%) and petal width less than 1.4 will result in Non-Virginica (100%)

Split the data:¶

In [ ]:
#Split data in training and temp first
X_train, X_temp, y_train, y_temp = train_test_split(df_iris_corr, target, train_size=120, random_state=42)

# Split the temp set into validation and test sets
X_validate, X_test, y_validate, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Training: ",X_train.shape[0]) 
print("Validate: ",X_validate.shape[0])
print("Test:", X_test.shape[0])
Training:  120
Validate:  15
Test: 15
In [ ]:
from sklearn.linear_model import LogisticRegression

#I have taken this sequence of features based the correlation matrix. 
#These features are prioritized based on their correlation with the target
features = ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)','sepal width (cm)']

log_reg_1 = LogisticRegression(random_state=42)
log_reg_1.fit(X_train[features[0]].values.reshape(-1,1), y_train)

log_reg_2 = LogisticRegression(random_state=42)
log_reg_2.fit(X_train[features[:2]], y_train)

log_reg_3 = LogisticRegression(random_state=42)
log_reg_3.fit(X_train[features[:3]], y_train)

log_reg_4 = LogisticRegression(random_state=42)
log_reg_4.fit(X_train[features[:4]], y_train)

#Took help from class notebook
#https://github.com/HamnaAshraf1/CSCN8010/blob/main/class_notebooks/logistic_regression/logistic_regression.ipynb 
Out[ ]:
LogisticRegression(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(random_state=42)
In [ ]:
pred1 = log_reg_1.predict(X_validate[features[0]].values.reshape(-1,1))
predict_prob1 = log_reg_1.predict_proba(X_validate[features[0]].values.reshape(-1,1))[:,1]

pred2 = log_reg_2.predict(X_validate[features[:2]])
predict_prob2 = log_reg_2.predict_proba(X_validate[features[:2]])[:,1]

pred3 = log_reg_3.predict(X_validate[features[:3]])
predict_prob3 = log_reg_3.predict_proba(X_validate[features[:3]])[:,1]

pred4 = log_reg_4.predict(X_validate[features[:4]])
predict_prob4 = log_reg_4.predict_proba(X_validate[features[:4]])[:,1]

#Took help from class_notebook to see how we can use predict and predict_proba
#https://github.com/HamnaAshraf1/CSCN8010/blob/main/class_notebooks/logistic_regression/logistic_regression.ipynb 
In [ ]:
data1 = {
    'Instance Number': np.arange(1, len(y_validate) + 1),
    'Probability of Predicting Virginica': predict_prob1,  
    'Model Prediction': pred1,
    'Ground Truth': y_validate
}

table1 = pd.DataFrame(data1)
table1
Out[ ]:
Instance Number Probability of Predicting Virginica Model Prediction Ground Truth
0 1 0.007232 0 0
1 2 0.004901 0 0
2 3 0.925283 1 1
3 4 0.925283 1 1
4 5 0.350797 0 0
5 6 0.636206 1 1
6 7 0.636206 1 1
7 8 0.198051 0 0
8 9 0.003318 0 0
9 10 0.003318 0 0
10 11 0.925283 1 1
11 12 0.792805 1 1
12 13 0.004901 0 0
13 14 0.893300 1 1
14 15 0.198051 0 0
In [ ]:
data2 = {
    'Instance Number': np.arange(1, len(y_validate) + 1),
    'Probability of Predicting Virginica': predict_prob2,  
    'Model Prediction': pred2,
    'Ground Truth': y_validate
}

table2 = pd.DataFrame(data2)
table2
Out[ ]:
Instance Number Probability of Predicting Virginica Model Prediction Ground Truth
0 1 0.000014 0 0
1 2 0.000015 0 0
2 3 0.998655 1 1
3 4 0.897028 1 1
4 5 0.210752 0 0
5 6 0.585651 1 1
6 7 0.936985 1 1
7 8 0.150068 0 0
8 9 0.000009 0 0
9 10 0.000009 0 0
10 11 0.870248 1 1
11 12 0.782899 1 1
12 13 0.000009 0 0
13 14 0.952741 1 1
14 15 0.016507 0 0
In [ ]:
data3 = {
    'Instance Number': np.arange(1, len(y_validate) + 1),
    'Probability of Predicting Virginica': predict_prob3,  
    'Model Prediction': pred3,
    'Ground Truth': y_validate
}

table3 = pd.DataFrame(data3)
table3
Out[ ]:
Instance Number Probability of Predicting Virginica Model Prediction Ground Truth
0 1 0.000015 0 0
1 2 0.000011 0 0
2 3 0.997968 1 1
3 4 0.874726 1 1
4 5 0.214948 0 0
5 6 0.593969 1 1
6 7 0.930751 1 1
7 8 0.174016 0 0
8 9 0.000011 0 0
9 10 0.000011 0 0
10 11 0.827581 1 1
11 12 0.759765 1 1
12 13 0.000009 0 0
13 14 0.952516 1 1
14 15 0.017705 0 0
In [ ]:
data4 = {
    'Instance Number': np.arange(1, len(y_validate) + 1),
    'Probability of Predicting Virginica': predict_prob4,  
    'Model Prediction': pred4,
    'Ground Truth': y_validate
}

table4 = pd.DataFrame(data4)
table4
Out[ ]:
Instance Number Probability of Predicting Virginica Model Prediction Ground Truth
0 1 0.000009 0 0
1 2 0.000006 0 0
2 3 0.998534 1 1
3 4 0.873922 1 1
4 5 0.207005 0 0
5 6 0.572730 1 1
6 7 0.946564 1 1
7 8 0.170670 0 0
8 9 0.000008 0 0
9 10 0.000008 0 0
10 11 0.820222 1 1
11 12 0.728198 1 1
12 13 0.000004 0 0
13 14 0.956114 1 1
14 15 0.016196 0 0
In [ ]:
accuracy_model_1 = accuracy_score(table1['Ground Truth'], table1['Model Prediction'])
accuracy_model_2 = accuracy_score(table2['Ground Truth'], table2['Model Prediction'])
accuracy_model_3 = accuracy_score(table3['Ground Truth'], table3['Model Prediction'])
accuracy_model_4 = accuracy_score(table4['Ground Truth'], table4['Model Prediction'])


print(f"Accuracy of Model 1: {accuracy_model_1:.2f}")
print(f"Accuracy of Model 2: {accuracy_model_2:.2f}")
print(f"Accuracy of Model 3: {accuracy_model_3:.2f}")
print(f"Accuracy of Model 4: {accuracy_model_4:.2f}")

#I took help from ChatGpt for this step. I was not sure which measure to choose
Accuracy of Model 1: 1.00
Accuracy of Model 2: 1.00
Accuracy of Model 3: 1.00
Accuracy of Model 4: 1.00

Here we check the accuracy between the model prediction and the actual target values. In our case all the models gave accurate predictions.

In [ ]:
decision_boundary = -log_reg_1.intercept_ / log_reg_1.coef_

plt.axvline(x=decision_boundary, color='red', linestyle='-', label='Decision Boundary')
plt.xlabel(features[0])
plt.show()
In [ ]:
decision_boundary_x1 = np.linspace(X_validate[features[0]].min(), X_validate[features[0]].max(), 10)
decision_boundary_x2 = -(log_reg_2.intercept_ / log_reg_2.coef_[0][1]) - ((log_reg_2.coef_[0][0] / log_reg_2.coef_[0][1]) * decision_boundary_x1)

plt.plot(decision_boundary_x1, decision_boundary_x2)
plt.xlabel(features[0])
plt.ylabel(features[1])
plt.show()
In [ ]:
x1, x2 = np.meshgrid(np.linspace(X_validate[features[0]].min(), X_validate[features[0]].max(), 10), np.linspace(X_validate[features[0]].min(), X_validate[features[0]].max(), 10))
decision_boundary_x3 = -(log_reg_3.intercept_ / log_reg_3.coef_[0][2]) - ((log_reg_3.coef_[0][0] / log_reg_3.coef_[0][2]) * x1) - ((log_reg_3.coef_[0][1] / log_reg_3.coef_[0][2]) * x2)
In [ ]:
import plotly.graph_objects as go

surface = go.Surface(x=x1, y=x2, z=decision_boundary_x3, colorscale='Reds', opacity=0.5, showscale=False)

fig = go.Figure(data=[surface])

fig.update_layout(
    scene=dict(
        xaxis_title='Feature 1',
        yaxis_title='Feature 2',
        zaxis_title='Feature 3'
    ),
    title='Decision Boundary (Plane)'
)

fig.show()
In [ ]:
failed = np.where(pred1 != y_validate) 
print("Failure encountered at:",failed[0])
Failure encountered at: []

For the validation set I could not find any failure patterns.
In some models while analyzing predictions against actual values, I noticed that if the features were not sorted by the correlation with the target values, then we would have encountered some failures.

Best model recommendation:

To check the best model to be used, we check it's accuracy_score or prediction_score. In our case it was same for all. Then we look into Probability Prediction.
In model 4, the probability for predicting Non-Virginica and Virginca is the most accurate compared to the other models.
Therefore, we use model 4 containing all features for the best prediction

In [ ]:
# test_acc1 = accuracy_score(y_test, log_reg_1.predict(X_test[features[0]].values.reshape(-1,1)))
# test_acc2 = accuracy_score(y_test, log_reg_2.predict(X_test[features[:2]]))
# test_acc3 = accuracy_score(y_test, log_reg_3.predict(X_test[features[:3]]))
test_acc4 = accuracy_score(y_test, log_reg_4.predict(X_test[features[:4]]))
print(f"Accuracy of Model 4: {test_acc4:.2f}")


# print(f"Accuracy of Model 1: {test_acc1:.2f}")
# print(f"Accuracy of Model 2: {test_acc2:.2f}")
# print(f"Accuracy of Model 3: {test_acc3:.2f}")
Accuracy of Model 4: 1.00